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Abstract 

In this paper we present a novel non-parametric method of simplifying piecewise linear 
(f") curves and we apply this method as a statistical approximation of structure within sequential 

data in the plane. We consider the problem of minimizing the average length of sequences of 

O consecutive input points that lie on any one side of the simplified curve. Specifically, given a 
sequence P of n points in the plane that determine a simple polygonal chain consisting of n — 1 
segments, we describe algorithms for selecting an ordered subset Q C P (including the first 
^ and last points of P) that determines a second polygonal chain to approximate P, such that 

the number of crossings between the two polygonal chains is maximized, and the cardinality 
of Q is minimized among all such maximizing subsets of P. Our algorithms have respective 
, running times 0(n 2 logn) when P is monotonic and 0(n 2 log 2 n) when P is an arbitrary simple 

polyline. Finally, we examine the application of our algorithms iteratively in a bootstrapping 
technique to define a smooth robust non-parametric approximation of the original sequence. 



> 



1 Introduction 

Given a simple polygonal chain P (a polyline) denned by a sequence of points (pi,p 2 , ■ ■ ■ ,p n ) in the 
plane, the polyline simplification problem is to produce a simplified polyline Q — (qi,q2,--- ,Qk)i 

t-H where k < n. The polyline Q represents an approximation of P that optimizes one or more 

objectives evaluated as functions of P and Q. For P to be simple, the points p\, .. . ,p n must be 

• *h distinct and P cannot intersect itself. 

Motivation for studying polyline simplification comes from the fields of computer graphics and 
cartography, where simplification is used to render vector-based features such as streets, rivers, or 
coastlines onto a screen or a map at appropriate resolution with acceptable error [I], as well in 
problems involving computer animation, pattern matching and geometric hasing (see survey by Alt 
and Guibas for details [3] ) . Our present work removes the arbitrary parameter previously required 
to describe acceptable error between P and Q, and provides a simplification method that is robust 
to some forms of noise. 
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jason_morrison9umanitoba. ca and mskalaScs .umanitoba. ca 
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Typical polyline simplication algorithms require that "distance" between two polylines be mea- 
sured using a function denoted here by £(P, Q). The specific measure of interest differs depending 
on the focus of the particular problem or article; however, three measures are popular: Chebyshev 
error Qqi Hausdorff distance and Frechet distance £p. In informal terms, the Chebyshev error is 
the maximum absolute difference between y-coordinates of P and Q (maximum residual) ; the sym- 
metric Hausdorff distance is the distance between the most isolated point of P or Q with respect 
to the other polyline; and the Frechet distance is more complicated, being the shortest possible 
maximum distance between two particles each moving forward along P and Q. Alt and Guibas 
give more formal definitions [3] . We define a new measure of quality or similarity, to be maximized, 
rather than using an "error" to be minimized. Our crossing measure is a combinatorial description 
of how well Q approximates P. It is invariant under a variety of geometric transformations of the 
polylines, and is often robust to uncertainty in the locations of individual points. 

Previous work on polyline simplification is generally divided into four categories depending on 
what property is being optimized and what restrictions are placed on Q |]. Problems can be 
classified as those that either require an approximating polyline Q having the minimum number of 
segments (minimizing |Q|) for a given acceptable error f(P, Q) < e, or a Q with minimum error 
C(P, Q) for a given value of \Q\. These are called min-# problems and min-e problems respectively. 
These two types of problems are each further divided into "restricted" problems where the points 
of Q are required to be a subset of those in P and to include the first and last points of P (q\ = pi 
and qk — p n ), and "unrestricted" problems, where the points of Q may be arbitrary points on the 
plane. Under this classification, the polyline simplification Q we examine is a restricted min-^ 
problem for which a subset of points of P is selected (including pi and p n ) where the objective 
measure C I {P,Q) is the number of crossings between P and Q and an optimal simplification first 
maximizes (rather than minimizing) the crossing number and then has a minimum \Q\ given the 
maximum crossing number. 

While the restricted min-# problems find the smallest sized approximation within a given error 
e, an earlier approach was to find any approximation within the given error. The cartographers 
Douglas and Peucker [7] developed a heuristic algorithm where an initial (single segment) approxi- 
mation was evaluated and the furthest point was then added to the simplification. This technique 
remained inefficient until series of papers by Hershberger and Snoeyink concluded that the problem 
could be solved in 0(n\og* n) time and linear space [5]. 

The most relevant previous literature is on restricted min-# problems. Imai and Iri |9] presented 
an early solution to the restricted polyline simplification problem using 0(n 3 ) time and 0{n) space. 
The version they study optimizes k = \Q\ while maintaining that the Hausdorff metric between Q 
and P is less than the parameter e. Their algorithm was subsequently improved by Melkman and 
O'Rourke |10. to 0(n 2 log n) time and then by Chan and Chin |3] to 0(n 2 ) time. Subsequently, 
Agarwal and Varadarajan [5] changed the approach from finding a shortest path in an explicitly 
constructed graph to an implicit method that runs in 0(j{5)n^ +s ) time. Agarwal and Varadarajan 
used the L\ Manhattan and Chebyshev metrics instead of the previous works' Hausdorff metric. 
Finally, Agarwal et al. study a variety of metrics and give approximations of the min-# problem in 
0(n) or 0(n log n) time. 

Our algorithm for minimizing \Q\ while optimizing our non-parametric quality measure requires 
0(n 2 logn) time when P is monotonic in x, or 0(n 2 log 2 n) time when P is a non-monotonic simple 
polyline on the plane, both in 0{n) space. The near-quadratic times are remarkably similar to the 
optimal times achieved in the parametric version of the problem using Hausdorff distance [T] |3] , 
suggesting the possibility that the problems may have similar complexities. 
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In the next section, we define the crossing measure x{Qi P) an d relate the concepts and proper- 
ties of x(QjP) to previous work in both polygonal curve simplification and robust approximation. 
In Section[3j we describe our algorithms to compute simplifications of monotonic and non-monotonic 
simple polylines that maximize x(Q>P)- Section [4] presents our results in applying the method to 
x-monotonic polylines that model 2-D functional (e.g., measured) data and describes the use of this 
simplification method to approximate "shape" and "noise" without assuming a parametric model 
for either. 



2 Crossing Measure 

The crossing measure x(Qt P) is defined for a sequence of n distinct points P = (pi,p2, ■ • • ,Pn) and 
a subsequence of k distinct points Q C P,Q — (q%, q-x-, ■ ■ ■ > <Zfc) with the same first and last values: 
qi = pi and qk = Pn- For each pi let (xj, j/j) = p, £ f 2 . To understand the crossing measure it 
is necessary to introduce the idea of left and right sidedness of a point relative to a directed line 
segment. A point pj is on the left side of a segment S^i+i = [pj,Pi+i] if the signed area of the 
triangle formed by the points Pi,Pi+i,Pj is positive. Correspondingly, pj is on the right side of the 
segment if the signed area is negative. The three points are collinear if the area is zero. 

For any endpoint qi of a segment in Q it is possible to determine the side of P on which qi 
lies. Since Q is a polyline using a subset of the points defining P, for every segment Sj^+i there 
exists a corresponding segment of SWjWtj+i) sucn that < i + 1 < ir(j + 1). The endpoints of 

S„fj) i7r (j + i) are given a side based on Sj^+i and vice versa. Two segments intersect if they share a 
point. Such a point is interior to both segments if only if both segments change sides with respect 
to each other or the intersection is at an endpoint of at least one endpoint is collinear to the other 
segment [Til P- 566]. The crossing measure x{Q>P) is the number of times that Q changes sides 
from properly left to properly right of P due to an intersection between the polylines. A single 
crossing can be generated by any of five cases listed below (see Figure [TJ) : 

1. A segment of Q intersects P at a point distinct from any endpoints; 

2. two consecutive segments of P meet and cross at a point interior to a segment of Q; 

3. one or more consecutive segments of P are collinear to the interior of a segment of Q with 
the previous and following segments of P on opposite sides of that segment of Q; 

4. two consecutive segments of P share their common point with two consecutive segments of Q 
and form a crossing; or 

5. in a generalization of the previous case, instead of being a single point the intersection com- 
prises one or more sequential segments of P and possibly Q that are collinear or identical. 



In Section 2.1 we discuss how to compute the crossings for the first three cases, which are all 
cases where crossings involve only one segment of Q. The remaining cases involve more than one 
segment of Q, because an endpoint of one segment of Q or even some entire segments of Q are 
coincident with one or more segments of P; those cases are discussed in Section [272] 

In the case where the x-coordinates of P are monotonic, P describes a function Y of x and Q 
is an approximation Y of that function. The signs of the residuals r — (ri, r^, ■ ■ ■ , r n ) — Yp — Y 
are computed at the x-coordinates of P and are equivalent to the sidedness described above. The 
crossing number is the number of proper sign changes in the sequence of residuals. The resulting 



3 



Pi-2 9j + i Pi-2 qj+i 




Figure 1: Examples of the five cases generating a single crossing. 




Figure 2: Crossings are indicated with a square and false crossings are marked with a red x. Crossing 
are only counted when a segment intersects the portion of the chain between its own endpoints. 

simplification maximizes the likelihood that adjacent residuals would have different signs, while 
minimizing the number of original data points retained conditional on that number of sign changes. 
Note that if r was independently and identically selected at random from a distribution with 
median zero, then any adjacent residuals in the sequence (n, T%, . . . , r n ) would have different signs 
with probability 1/2. 

2.1 Counting Crossings With a Segment 

To compute a simplification Q with optimal crossing number for a given P, we consider the optimal 
numbers of crossings for segments of P and combine them in a dynamic programming algorithm. 
Starting from a point pi we compute optimal crossing numbers for each of the n — i segments that 
start at point pi and end at some pj with i < j < n. Computing all n — i optimal crossing numbers 
for a given pi simultaneously in a single pass is more efficient than computing them for each (pi,Pj) 
pair separately. These batched computations are performed for each pi and the results used to find 
Q. 

To compute a single batch we will consider the angular order of points in Pj + i )n = {pi+i, . . . ,p n } 
with respect to pi. Let Pi{j) be a function on the indices representing the clockwise angular order of 
points pj within this set, such that Pi{j) — 1 for all pj having the smallest clockwise angle measured 
from the vertical line passing through pi, and Pi{j) < Pi(k) if and only if this angle for pj is less 
than or equal to the corresponding angle for pk . See Figure [3j Using this angular ordering we 
partition Pj+i jTl into chains and process the batch of crossing number problems as discussed below. 

We define a chain with respect to pi to be a consecutive sequence P^ej C P+i !Tl with non- 
decreasing angular order. That is, either pt(l') > pi(j + 1) > • • ■ > pi{l) or pi(l) < pi(j + 1) < 
■ ' ' < Pi(i')i with the added constraint that chains cannot cross the vertical ray above pi. Each 
segment that does cross is split into two pieces using two "artificial" points on the ray per crossing 
segment. The points on the "low" segment portions have rank pi = 1 and the identically placed 
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Figure 3: An example of the angular order of vertices in Pj_|_i in and the resulting chains 



other points have rank pi = n + 1. These points do not increase the complexities by more than a 
constant factor and are not mentioned again unless specifically required. Processing Pi + \ ;n into its 
chains is done by first computing the angle from vertical for each point and storing that information 
with the points. Then the points are sorted by angular order around pi and Pi(j) is computed as 
the rank of p(j) in the sorted list. Since this algorithm works in the real RAM model, this step 
can be done in 0(n log n) time with linear space to store the angles and ranks. Creating a list of 
chains is then computable in 0{n) time and space by storing the indices of the beginning and end 
of each chain encountered while checking points pj in increasing order from j = i + 1 to j = n. The 
process to identify all chains involves two steps. First all segments are checked to determine if they 
intersect with the vertical ray, each in 0(1) time. Such an intersection implies that the previous 
chain should end and the segment that crosses the ray should be a new chain (note an artificial 
index of i+ J can denote the point that crosses the vertical) . The second check is to determine if the 
most recent segment has a different angular direction from the previous segment. If so, the previous 
chain "ended" with the previous point and the new chain "begins" with the current segment. Each 
chain is oriented from lowest angular order to highest angular order. 

Lemma 1 Consider any chain P^i> (wlog assume I < £' ). With respect to pi the segment Si.j : 
{i < j < n ) can have at most one crossing strictly interior to Pt£* . 

Proof. Three cases need to be considered. 

Case 1: pi{£) = pi(j) or pi(j) = pi{£')- Note that if I = n then no crossing can exist because 
at least one end (or all) of Pk/ is collinear with Sij and no proper change in sidedness can occur 
in this chain to generate a crossing. 

Case 2: pi(j) ^ \pi{£), Pi{£')\- These cases have no crossings with the chain because Pk.i is 
entirely on one side of Sij. A ray exists between either pi(j) < pi(£) or Pi(m) < Pi(j) that separates 
Pk,i from Sp t ij and thus no crossings can occur between the segment and the chain. 

Case 3: Pi{j) € (pi(t), pi(£')). Assume that the chain causes at least two crossings. Pick 
the lowest index segment for each of the two crossings that are the fewest segments away from 
Pi. By definition there are no crossings of segments between these two segments. Label the point 
with lowest index of these two segments p\ and the point with greatest index py . Define a possibly 
degenerate cone Phi with a base pi and rays through p\ and p\> . This cone, by definition, separates 
the segments from p\+\ to pa'-i from the remainder of the chain. Since this sub-chain cannot circle 
Pi entirely there must exist one or more points that have a maximum (or minimum) angular index, 
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which is a contradiction to the definition of the chain. Hence there must be zero or one crossings 
only. □ 



The algorithm for computing the crossing measure on a batch of segments depends on the nature 
of P. If P is x-monotone, then the chains can be ordered by increasing cc-coordinates or equivalently 
by the greatest index amoung the points that define them. Then a segment Sij intersects any 
chain Po' exactly once if its x-coordinates are less than pj and Pi{j) G (pi(£), pi(£')) (i.e., Case 3 
of Lemma [T]) . The algorithm maintains a modified segment tree with one angular order interval 
per chain previously included, using the modified segment tree described by van Kreveld et al. [51 
p. 237]. This data structure requires O(logn) time per insertion and 0{n) space. Each point's 
crossing number is queried in O(logn) time, with points examined in order of increasing indices. 
Once each chain's points have all been queried the chain's interval is added. Correctness follows 
from the fact that no segment considered can have a crossing within any chain it ends, and chains 
that span a point's angular order intersect once if the point is sufficiently distant from pi relative 
to the chain. These facts are guaranteed by x-monotonicity and the proof of Lemma [I] 

The problem becomes more difficult if we assume that P is simple but not necessarily monotonic 
in x. While chains describe angular order quite nicely, the non-monotone nature of P does not allow 
a consistent implicit ordering of chain boundaries. Thus queries will be of a specific nature: for a 
given point pj , we must determine how many chains are closer to i and have a lower maximum index 
than j. Note that chains do not cross and can only intersect at their endpoints due to the non- 
overlapping definition of chains and the simplicity of P. Therefore, sweeping a ray from pi , initially 
vertical, in increasing pi order defines a partial order on chains with respect to their distance from 
Pi. Using a topological sweep p. 481] it is possible to determine a unique order that preserves 
this partial ordering of chains. Since there are O(n) chains and changes in "neighbours" defining 
the partial order occur at chain endpoints, there are 0{n) edges in the partial order and this 
operation requires O(nlogn) time to determine the events in a sweep and 0(n) time to compute 
the topological ordering. Without loss of generality assume that the chains closest to pi have a 
lower topological order. 

In our algorithm each chain will be labelled with two labels: the maximum index of its defining 
points and the topological order. Furthermore, each point pj will be labelled with the topological 
order of the chain to which it belongs (or the minimum of the two if it is in two chains) . A sweep 
in increasing pi order maintains the set of chains whose range of angular orders properly includes 
the current Pi(j). Thus to query the number of crossings of Si j we need to determine from the 
current set of chains the number of them whose topological order is strictly less than the chain 
or chains containing pj and whose maximum index is less than j. Querying the set in this way is 
an orthogonal range counting query in R 2 , and such queries can be performed in 0(log 2 n) time 
and 0(n) space with insertion and deletion of chains in 0(log 2 n) time per event on an elementary 
pointer machine [5]. The order of operations is as follows: first, build the range counting structure 
by inserting all chains that begin at the vertical ray from pi\ next, for each unique angular order 
delete all segments whose maximum pi order is achieved (this maintains the proper intersection of 
Pi previously mentioned); compute the crossing number of all points with this angular order by 
querying the data structure; finally, add any new chains starting at this pi. The artificial points on 
the vertical are not queried. Correctness follows from Lemma [I] and the previous discussion. 
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Figure 4: Regions around pj that determine a crossing at Pj 

2.2 Crossings Due to Neighbouring Simplification Segments 

There are two cases of a crossing being generated that involve more than one segment of Q and it 
is these cases we address now. Suppose that pi = qj. Then there is an intersection between P and 
Q at this point, and we must detect if a change in sidedness accompanies this intersection. Assume 
initially that P does not contain any consecutively collinear segments; we will consider the other 
case later. 

We begin with the non-degenerate case where (j>j_i,£>j+i, <7i_i, (fo+i) are all distinct points. Each 
of the points qj-\ and qj+i can be in one of four locations: in the cone left of (j?i-i,pi,pi+i); in the 
cone right of {pi-i,Pi,Pi+i); on the ray defined by S^j-i; or on the ray defined by S^j+i. These 
are labelled in Figure [4] as regions I through IV respectively. In Cases 177 and IV it may also be 
necessary to consider the location of q^i or qj+\ with respect to Sj-2, i-i or 

Within the degenerate "case" where the points may not be unique: if pi = qj and Pi+\ 7^ Qj+x, 
then any change in sidedness is handled at pi and can be detected by verifying the previous side 
from the polyline. If, however, Pi+i = qj+i, then any change in sidedness will be handled further 
along in the simplification. 

By examining these points it is possible to assign a sidedness to the end of 5V(j-i).7r(j) an d 
the beginning of S w fj) !7r (j+i). Note that the sidedness of a point qj-i with respect to 5^2, «-i can 
be inferred from the sidedness of Pi-2 with respect to S n u-u t i-±, and that property is used in 
the case of regions III and IV. The assumed lack of consecutive collinear segments requires that 
{Pi-2,Pi+2} € I U II and thus Table [l] is a complete list of the possible cases when \P\ > 5. For 
cases involving 77/ or IV where i ^ [3, n— 2] then the case is labelled collinear (we discuss the 
consequences of this choice later). 

A single crossing occurs if and only if the end of iS„.y— is 011 the left or right when the 
beginning of tS^wj+i) is the opposite. Furthermore, the end of any simplification Qi.j of P\,i that 
ends in S^U— is labelled left or right in the same way that the end of SV^-i)^ is labelled. This 
labelling is consistent with the statement that the simplification last approached the polyline P 
from the side indicated by the labelling. To maintain this invariant in the labelling of the end of 
polylines, if SVq-i)^ is labelled as collinear then the simplification Qij needs to have the same 
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Entity 


Categorization 


Conditions 


End of SVy-i),,: 


collinear (1) 


Qj-i =Pi-i 
Qj+i = Pi+i 


left (2) 


Qj-i e / 
(qj-l G HI) A (pi-2 £ II) 
fa-l G IV) A ( Pl+2 G II) 


right (3) 


Qj-i e II 
G 777) A fe_ 2 G I) 
(<&_! G IV) A (p i+2 G /) 


Beginning of S hw{j+1) 


collinear (1) 


qj-i =Pi-i 

Qj+i = Pi+i 


left (2) 


Qj+1 G 7 
(<? i+ i G III) A (p,_ 2 G 77) 
(q j+1 G TV) A (p i+2 G 77) 


right (3) 


(q j+1 G 7/7) A (pi_ 2 G 7) 
(ty+i G IV) A (p i+2 G 7) 



Table 1: Left, right, and collinear labels applied to beginning or end of a segment at Pj 



labelling as Qi ,-_i. As a basis case, the simplifications of Pi 2 and Pi 1 are the result of the identity 
operation so they must be collinear. Note that a simplification labelled collinear has no crossings. 

The constant number of cases in Table [l] and the constant complexity of the sidedness test, 
imply that we can compute the number of crossings between a segment and a chain, and there- 
fore the labelling for the segment, in constant time. Let ;, S^y-iVi) represent the number 
of extra crossings (necessarily or 1) introduced at pj by joining Qi j and S„(j-i\i. We have 

x(Qi,j, p i,i) = x(Qij-i, Pi,n(j-t)) + X(Sn(j-i),i,P*(j-i),i) + v(Qi,j,Sn(j-i),i), which highlights 
possibility of computing the optimal simplification incrementally in a dynamic programming algo- 
rithm. 

It remains to consider the case of sequential collinear segments. The polyline P' can be simplified 
into P by merging sequential collinear segments, effectively removing points of P' without changing 
its shape. When joining two segments where p\ = qj, p\_x and p'i + 1 define the regions as before 
but there is no longer a guarantee regarding non-collinearity of f^_ 2 or p' i+2 with respect to the 
other points. The points gj_i and qj-2 are now collinear if and only if either of them are entirely 
collinear to the relevant segments of P. Our check for equality is changed to a check for equality 
or collinearity. We examine the previous and next points of P' that are not collinear to the two 
segments [p'^nPi] and [Pi,Pi + i]- We find such points for every p' { in a preprocessing step requiring 
linear time and space, by scanning the polyline for turns and keeping two queues of previous and 
current collinear points. 

3 Optimal Crossing Measure Simplification 

In this section we describe our dynamic programming approach to computing a polyline Q that is a 
subset of P having minimum size k conditional on maximal crossing measure x(Qi P)- We compute 
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x{Si^j,Pi,j) in batches, as described in the previous section. Our algorithm will maintain the best 
known simplifications of P\^ for all i € [l,n] and each of the three possible labellings of the ends. 
We refer to these paths as {Q) a ,i where a describes the labelling at i: a — 1 for collinear, a — 2 for 
left, or a — 3 for right. 

To reduce the space complexity we do not explicitly maintain the (potentially exponential-size) 
set of all simplifications Q a .i- Instead, for each simplification corresponding to (a, i) we maintain: 
x(2cij Pi,i) (initially zero); the size of the simplification found \Q a ,i\ (initially n + 1); the starting 
index of the last segment added /3 CTi i (initially zero) ; and the end labelling of the best simplification 
that the last segment was connected to (initially zero). The initial values described represent 
the fact that no simplification is yet known. The algorithm begins by setting the values for the 
optimal identity simplification for P\ \ to the following values (note a = 1). 

x(Qi,i,A,i) = o 

|CmI = i 
= i 
n,i = i 

A total of n — 1 iterations are performed one for each i £ [1, n — 1] where a batch of segments 
Si.j : i < j < n is each considered in a possible simplification ending in that segment. Each 
iteration begins with the set of simplifications { Ver, Q a .£ '■ i < i} being optimal, with maximal 
values of x{Qa,ii P\.t) an d minimum size \Q a ,i\ for each of the specified a and £ combinations. The 
iteration proceeds to calculate the crossing numbers of all segments starting at i and ending at a 
later index {x(Si,j > Pi,j )\j <= {h n ]i}> using the method from Section[2] For each of the segments Sij 
we compute the sidedness of both the end at j (a'j) and the start at i (v'j). Using v'j and all values 
of {a : (3 a _i > 0} it is possible to compute rj(Q ati , Sij) using just the labellings of the two inputs 
(see Table [2]). It is also possible to determine the labelling of the end of the concatenated polyline 
ip(a,a'j) using the labelling of the end of the previous polyline a and the end of the additional 
segment a'j (also shown in Table j2J). 
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Table 2: Tables defining the computation of additional crossings r?(/3o-,i, v'j) due to concatenation 
and the end labelling of the concatenated polyline ip(cr, <r'j) 

With these values computed, the current value of x{Q^(cr,a'.)) is compared to x(Qa,i)+x(Si,j , Pi.j)~ 
r l{fia,i,'v'j) and if the new simplification has a greater or equal number of crossings crossings then 
we can compute: 
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ie«r,ii + i 



Correctness of this algorithm follows from the fact that each possible segment ending at i + 1 
is considered before the (i + l)-st iteration. For each segment and each labelling, at least one 
optimal polyline with that labelling and leading to the beginning of that segment must have been 
considered, by the inductive assumption. Since the number of crossings in a polyline only depends 
on the crossings within the segments and the labellings where the segments meet, the inductive 
hyopthesis is maintained through the (i + l)-st iteration. It was also trivially true in the basis 
case i = 1. With the exception of computing the crossing number for all of the segments, the 
algorithm requires 0{n) time and space to update the remaining information in each iteration. The 
final post-processing step is to determine a max = argmax CT x(Q<y,n), finding the simplification that 
has the best crossing number. We use the /? and r information to reconstruct Qa max ,n in 0(k) 
remaining time. 

The algorithm requires 0(n) space in each iteration and 0(nlog 2 n) time per iteration to com- 
pute crossings of each batch of segments dominates the remaining time per iteration. Thus for 
simple polylines Q amal , n is computable in 0(n 2 log 2 n) time and 0(n) space and for monotonic 
polylines it is computable in 0(n 2 log n) time and 0(n) space. 

4 Results and Smooth Shape Approximation 

Our goal was to approximate shape (and noise) in a parameterless fashion. In this section we 
present results of applying the simplification to monotonic data with and without noise. We then 
describe how we perform a parameterless smooth boostrap-like operation, and give results for the 
median and the 95% confidence intervals (i.e., error approximations). We conclude by showing the 
results applied to a spectrum acquired from a Fourier transformed infrared microscope. 

Our first point set is given by p — (x, x 2 + 10 • sinx) for 101 equally spaced points x e [—10, 10]. 
The maximal-crossing simplification for this point set has 5 points and 7 crossings. We generated 
a second point set by adding standard normal noise generated in Matlab with randn to the first 
point set. The maximal-crossing simplification of the data with standard normal noise has 19 points 
and a crossing number of 54. We generated a third point set from the first by adding heavy-tailed 
noise consisting of standard normal noise for 91 data points and standard normal noise multiplied by 
ten for the remaining ten points. The maximal-crossing simplification of the signal contaminated by 
heavy-tailed noise has 20 points and a crossing number of 50. These results are shown in Figure [5| 

As can be seen in Figure [5j the crossing-maximization procedure gives a much closer approxi- 
mation to the signal when there is some nonzero amount of noise present to provide opportunities 
for crossings. We might expect that in the case of a clean signal, we could obtain a more useful 
approximation by artificially adding some noise before computing our maximal-crossing polyline. 
However, to do so requires choosing an appropriate distribution for the added noise, and we wish 
to keep our procedure parameterless. 

The residuals between the data and the optimal crossing approximation form a good first ap- 
proximation of the noise within the data. If these residuals are not zero-centered then their median 
should be subtracted to provide a zero-centered distribution. We can take the original data points 
and subtract, at each point, a value selected uniformly at random with replacement from the zero- 
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Figure 5: The optimal crossing path for p — (x, x 2 + 10 • sin x), without noise, with standard normal 
noise and with heavy-tailed (mixed gaussian) noise. 
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Figure 6: The data is in black (with crosses) and contains 1610 data points), the median bootstrap 
approximation is in red and the green lines are the bootstrapped approximations of the 5 and 95 
percentiles. 



centered residuals. Then by finding the maximal-crossing polyline of the resulting modified data, 
we obtain a noise-based approximation. We repeat this procedure for different random selections 
of which residuals to apply to which data points. This smooth bootstrap-like approach is similar to 
"smooth bootstrap" estimation, which normally would use a parameter-based model of the error 
[T2] . Our procedure is parameterless. Repeated evaluation of noise based approximations produce 
multiple y values for all x values. Using the median of these and finding the 5th and 95th per- 
centiles results in an approximation of signal and noise after relatively few iterations. Results from 
90 iterations of this calculation applied to a Fourier transformed infrared spectrum are shown in 
Figure (6) 

5 Discussion and Conclusions 

The optimal crossing measure simplification is robust to small changes of x- or y-coordinates of any 
Pi when the points are in general position. This robustness can be seen by considering that the 
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crossing number of every simplification depends on the arrangement of lines induced by the line 
segments, and any point in general position (by definition) can be moved some e without affecting 
the combinatorial structure of the arrangement. The simplification is also invariant under affinc 
transformations because these too do not modify the combinatorial structure of the arrangement. 
In the case of x-monotonic polylines, the simplification possesses another useful property: the more 
a point is an outlier, the less likely it is to be included in the simplification. In the limit, increasing 
the y-coordinate of any point pi to infinity (x-monotonicity remains unchanged) will remove pi 
from the simplification. That is, if pi is initially included in the simplification, then once pi moves 
sufficiently upward, the two segments of the simplification adjacent to pi cease to cross any input 
segments in P. 

We discuss an additional improvement achievable by bounding sequence lengths. If a parameter 
m is chosen in advance such that we require that the longest segment considered can span at 
most to — 2 vertices, then with the appropriate changes the algorithm can find the minimum sized 
simplification conditional on maximum crossing number and having a longest segment of length at 
most to in O (nm log 2 m) time for simple polylines or 0{nm\ogm) time for monotonic polylines, 
both with linear space. Since long line segments tend to be rare in good simplifications, we can set 
to to a relatively small value and still obtain good simplification results while significantly improving 
speed. 

Finally, this research presents a method to approximate the shape and noise without an implied 
model for the data nor a need for parameters. Application of the shape and noise approximation in 
the monotonic (functional) data case has shown promising results when used in conjunction with 
the bootstrap method described here. 
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